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Abstract: This paper lays the foundations for a unified framework for 
numerically and computationally applying methods drawn from a range 
of currently distinct geometrical approaches to statistical modelling. In so 
doing, it extends information geometry from a manifold based approach 
to one where the simplex is the fundamental geometrical object, thereby 
allowing applications to models which do not have a fixed dimension or 
support. Finally, it starts to build a computational framework which will 
act as a proxy for the space of all distributions that can be used, in partic- 
ular, to investigate model selection and model uncertainty. A varied set of 
substantive running examples is used to illustrate theoretical and practical 
aspects of the discussion. Further developments are briefly indicated. 
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1. Introduction 

The application of geometry to statistical theory and practice has produced a 
number of different approaches and this paper will involve three of these. The 
first is the application of differential geometry to statistics, which is often called 
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information geometry. It largely focuses on typically multivariate, invariant and 
higher-order asymptotic results in full and curved exponential families through 
the use of differential geometry and tensor analysis; key references include [1], 
[6], [7], [30] and [21]. Also included in this approach are consideration of cur- 
vature, dimension reduction and information loss, see [13] and [27]. The second 
important, but completely separate, approach is in the inferentially demanding 
area of mixture modelling, a major highlight being found in [25] where convex 
geometry is shown to give great insight into the fundamental problems of infer- 
ence in these models and to help in the design of corresponding algorithms. The 
third approach is the geometric study of graphical models, contingency tables, 
(hierarchical) log-linear models, and related topics involving the geometry of 
extended exponential families. Important results with close connections to the 
approach in this paper can be found in [34] and [14], while the wider field of 
algebraic statistics is well-reviewed in [32] and [16]. 

This paper has the following four objectives: (1) to use the tool of the ex- 
tended multinomial distribution (see [8], [34], [14] and [11]) to construct a frame- 
work which unifies all of the above geometric approaches; in particular, to show 
explicitly the links between information geometry, extended exponential fami- 
lies and Lindsay's mixture geometry, (2) to show how this unifying framework 
provides a natural home for numerically implementing algorithms based on the 
geometries described above, (3) to extend the results of information geometry 
from the traditional manifold based approach to models which do not have a 
fixed dimension or support, and (4) to start to build a computational framework 
which will act as a proxy for the 'space of all distributions' which can be used, in 
particular, to investigate model selection and model uncertainty. This paper lays 
the conceptual foundations for these goals, with more detailed developments to 
be found in later work. We call this numerical way of implementing geometric 
theory in statistics computational information geometry. No confusion should 
arise from the fact that the same name is given to a cognate, but distinct, topic 
in machine learning: see for example [31]. 

In practice, a single statistical problem can involve more than one of the above 
geometries - potentially all three - this plurality being handled naturally in our 
unifying framework. Indeed, we use a varied set of substantive running examples 
to illustrate theoretical and practical aspects of the development. Examples 1 
and 4 (Section 1.1) are chosen to illustrate computational information geometric 
issues in mixture models. Example 2 shows issues in full and curved exponential 
families, while Example 3 looks at the geometry of logistic regression. To aid with 
visualisation additional low dimensional multinomial models are also introduced 
in the body of the paper. 

The key idea of this paper is to represent statistical models - sample spaces, 
together with probability distributions on them - and associated inference prob- 
lems, inside adequately large but finite dimensional spaces. In these embedding 
spaces the building blocks of the three geometries described above can be nu- 
merically computed explicitly and the results used for algorithm development. 
In §1.2 and §6 we reflect on the generality of working in this finite, discrete 
framework even with models for continuous random variables. 
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Accordingly, after a possible initial discretisation, the space of all distribu- 
tions for the random variable of interest can be identified with the simplex, 



together with a unique label for each vertex, representing the random variable. 
Modulo discretisation, this structure therefore acts as a universal model. Clearly, 
the multinomial family on k + 1 categories can be identified with the relative 
interior of this space, int(A k ), while the extended family allows the possibility 
of distributions with different support sets. 

The starting point for much of statistical inference is a working model for 
observed data comprising a set of distributions on a sample space. A working 
model M. can be represented by a subset of A fc and may be specified by an 
explicit paramcterisation, such as Example 2, or as the solution of a set of equa- 
tions, such as Example 4. Computational information geometry explicitly uses 
the information geometry of A fc to numerically compute statistically important 
features of M.. These features include properties of the likelihood, which can 
be nontrivial in many of the examples considered here, the adequacy of first 
order asymptotic methods - notably, via higher order asymptotic expansions - 
curvature based dimension reduction and inference in mixture models. 

1.1. Examples 

For ease of reference the main examples considered in this paper are briefly 
described here, together with the main points which they illustrate. 

Example 1. Mixture of binomial distributions This example comes from 
[22] where the authors state that 'simple one-parameter binomial and Poisson 
models generally provide poor fits to this type of binary data ', and therefore it 
is of interest to look in a 'neighbourhood' of these models. The extended multi- 
nomial space is a natural place to define such a 'neighbourhood' and a new 
computational algorithm defined in §5 is used for inference. 

Example 2. Censored exponential This example looks at a continuous re- 
sponse variable - a censored survival time. Section 1.2 considers applying the 
results of computational information geometry to models for continuous random 
variables while Theorems 4-1 and 4-2 show how this can be done with negli- 
gible loss for inference. In this case also results on curvature based dimension 
reduction are illustrated. 

Example 3. Logistic regression This is a full exponential family that lies 
in a very high dimensional simplex when considered as a model for the joint 
distribution of N binary response variates. In this example, both the existence 
of the maximum likelihood estimate (see [34] and [14]) and higher order approx- 
imations to sampling distributions are considered. 
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Example 4. Tripod model The tripod example is discussed in [35] and [36]. 
The directed graph is shown in Fig. 1, where there are binary variables Xi, 
i = 1,2,3, on each of the terminal nodes these being assumed independent given 
the binary variable at the internal node H . In the model, it is assumed H is hid- 
den (i.e. not observed) so the model is a mixture of members of an exponential 
family. Despite the model's apparent simplicity, the mixture structure can gen- 
erate multiple modes in the likelihood, illustrating difficult identification issues. 




Fig 1 . Graph for Tripod model 



1.2. Discretisation 

The approach taken in this paper is inherently discrete and finite. Sometimes, 
this is with no loss at all, the models used involve only such random variables. 
In general, suitable finite partitions of the sample space can be used, for which 
an appropriate theory is developed. While this is clearly not the most general 
case mathematically speaking (an equivalence relation being thereby induced), 
it does provide an excellent foundation on which to construct a computational 
theory. Furthermore, since real world measurements can only be made to a fixed 
precision all models can - arguably, should - be thought of as fundamentally 
categorical. The relevant question for a computational theory is then: what is 
the effect on the inferential objects of interest of a particular selection of such 
categories? This is looked at in Theorem 4.1 and 4.2. 

Example 2 (continued). Here the data taken from [17], while being treated 
as continuous, is only recorded at integer number of days. Thus as far as any 
statistical analysis that can be carried out is concerned there is literally zero 
loss in treating it as sparse categorical. For Figs. 9 and 10 a further level of 
coarseness was added by selecting bins of size 4 days. As can be seen from the 
likelihood plot, Fig. 9, there is effectively no inferential loss in such a choice. 
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1 . 3. Structure of paper 

The paper is structured as follows. Section 2 looks at the information geometry 
of A fc . It shows the geometry to be both explicit and tractable. In particular, the 
way that global geometry determines the relationship between the natural and 
mean parameters of exponential families is discussed in §2.1. The Fisher infor- 
mation is also key and results on its spectrum are found in §2.2, while the shape 
of the likelihood function is discussed in §2.3. Section 3 looks at the importance 
of understanding the closure of A k , and of exponential families embedded in 
A , where we consider the computation of limit points and the corresponding 
behaviour of maximum likelihood estimates. Direct applications of the numeri- 
cal approach are discussed in Section 4. Issues considered include: using higher 
order asymptotic methods, such as Edgeworth and saddlepoint expansions and, 
also, dimension reduction and information loss. Section 5 looks at the way that 
the mixture geometry of [25] fits naturally into the computational information 
geometry framework. In this section, Examples 1 and 4 show the utility of the 
methods. Again the issue of dimension, this time in the — 1-geometry, comes to 
the fore. Throughout, proofs and more technical discussions are found in the 
appendices. 

2. Geometry of extended multinomial distribution 

The key idea behind computational information geometry is that models can be 
embedded in a computationally tractable space with little loss to the inferen- 
tial problem of interest. Information geometry is constructed from two different 
affinc geometries related in a non-linear way via duality and the Fisher informa- 
tion, see [1] or [21]. In the full exponential family context, one affine structure 
(the so-called +1 structure) is defined by the natural parameterization, the 
second (the —1 structure) by the mean parameterization. The closure of expo- 
nential families has been studied by [4] , [8] , [23] and [33] in the finite dimensional 
case and by [11] in the infinite dimensional case. One important difference in 
the approach taken here is that limits of families of distributions, rather than 
pointwisc limits, are central. 

This paper constructs a theory of information geometry following that intro- 
duced by [1] via the affine space construction introduced by [30] and extended 
by [26]. Since this paper concentrates on categorical random variables, the fol- 
lowing definitions are appropriate. Consider a finite set of disjoint categories or 
bins B = {Bi}ifzA- Any distribution over this finite set of categories is defined 
by a set {i^i\i£A which defines the corresponding probabilities. 

Definition 2.1. The — l-affine space structure over distributions onB := {i?j}j gj 4 
is (X mix ,V mix ,+) where 




and the addition operator + is the usual addition of sequences. 
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In Definition 2.1 the space of (discretised) distributions is a — 1-convex sub- 
space of the affine space (X mix , V m i X , +)■ A similar affine structure for the +1- 
geometry, once the support has been fixed, can be derived from the definitions 
in [30]. 

The extended multinomial family, over k + 1-categories, characterized by the 
closed simplex of probabilities A k defined in (1.1) will be the computationally 
tractable space. For these families, the ±1 dual affine geometries are explicit, the 
only 'hard' computational tasks being the non-linear mapping between convex 
subsets of affine spaces and the computation of the mixed parameterization, as 
defined in [5]. Furthermore, the Fisher information and its inverse are explicit 
and, perhaps more relevantly due to its potentially high order (the dimension 
of the simplex) and non-constant rank, there are good ways of understanding 
and bounding its spectrum, as shown in §2.2. 

It is important to clarify why the closed extended multinomial distribution 
is used. First, in many examples the data is sparse in the sense that the sample 
size is much smaller than k + 1, the number of categories, so that the like- 
lihood, both in the multinomial and sometimes in the embedded models, is 
typically maximized on the boundary. Second, it will be shown that the global 
shape of the likelihood function is determined by boundary behaviour. Third, 
first order asymptotic approximations are rarely uniform across A fc and the 
higher order asymptotic expansions of computational information geometry can 
indicate when the boundary is inferentially relevant. Finally, the link between 
information geometry and Lindsay's mixture geometries is defined by using the 
boundary of A fc . 

The probability simplex, and sub-models embedded in it, have been exten- 
sively studied in the geometric approach to graphical models, see [34], [14]. In 
this literature, other sampling schemes than the multinomial are also studied, 
boundary issues again being shown to have great importance. One of the im- 
portant new features here is the application of the full information geometry 
machinery to these models. 

2.1. Geometry of extended trinomial distribution 

To illustrate the information geometry of the extended multinomial distribution, 
the trinomial case is now described explicitly. The general case in fact will follow 
by obvious extensions, and shown later (Section 3.1), unless the dimension is so 
large that numerically evaluating sums becomes impractical, see [15]. 

Example 5. An explicit example of the information geometry of the extended 
trinomial model is shown in Fig. 2. The closed simplex in panel (a) represents the 
set of multinomial distributions with bin probabilities (^0,^1,^2) where TTi > 0. 

In this example, the vector b T = (1, 2, 3) was chosen, and the parallel lines in 
panel (a) are level sets of the mean of b T X, where X is the trinomial random 
variable. In the terminology of classical information geometry, these are — 1- 
geodesics, and it is immediate that they extend to the boundary in a very natural 
way. These lines lie in the (tangent) direction a which satisfies X)fc=o ak = ®> 
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and X)fc=o a k^k = 0. These lines are also shown in panel (b), but now in the +1 
(or natural) parameterization and so are non-linear. Note that the single line, 
labelled by the mean value equalling 2, corresponds to the — 1 geodesic passing 
through the vertex at (1,0) in panel (a). 



(a) -1 -geodesies in -1 -simplex (b) -1 -geodesies in +1 -simplex 




0.0 0.2 0.4 0.6 0.8 1.0 -10 -5 5 10 



(c) +1 -geodesies in -1 -simplex (d) +1 -geodesies in +1 -simplex 




0.0 0.2 0.4 0.6 0.8 1.0 -10 -5 5 10 



Fig 2. The information geometry of the extended trinomial model 

Panel (d) shows the relative interior of the extended trinomial in the nat- 
ural affine parameterization. The straight lines represent one dimensional full 
exponential families with probabilities of the form 

I 7T cxp(fl&p) 7Ti exp(6>bi) 7T 2 exp(6>fr 2 ) \ 

l 2 '2 '2 J' 

yEk^^expiebk) Efc=o exp(0& fe ) JJfe=o exp(9b k ) J 

each 7Tfc > 0. These are +1 -geodesies in the direction b through the base-point 
(710, 7Ti, 7r 2 ) and, by the strict positivity of the exponential function, their image 
in panel (c) lie strictly in the interior of the simplex. It is a standard result 
that these +1 parallel lines are everywhere orthogonal, with respect to the metric 
defined by the Fisher information matrix, to the —1-parallel lines shown in panels 
(a) and (b). Each of these parallel lines can be found by moving the base-point 
by 

(71-0,71-1,712) i-4 (7To,7ri,7r 2 ) +(7(00,01,02), 

X)fc=o ak ~ 0> where a is restricted so that all components remain non-negative, 
[26]. 

The key step in understanding the simplicial nature of the +1- geometry is to 
see how the limits of the +l-parallel lines are connected to the boundary of the 
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simplex. This is made clear in panel (c), where the +l-geodesics are plotted in 
the —1-affine parameters as curves. As a changes the limits of the curves clearly 
exist and lie on the boundary of the simplex. The closure of the +l-representation 
multinomial is defined to make these continuous limits defined "at infinity" in 
the +l-parameters and is shown schematically as the dotted triangle in panel b. 

2. 2. Spectrum of Fisher Information 

The material above looks explicitly at the ± 1-affine geometries of [1] while 
this section concentrates on the third part of Amari's structure, i.e. the Fisher 
information or 0-geometry. In any multinomial model, the Fisher information 
matrix and its inverse are explicit. Indeed, the O-geodesics and the correspond- 
ing geodesic distance are also explicit, see [1] or [21]. However, since the simplex 
glues together multinomial structures with different supports, and the compu- 
tational theory is in high dimensions, it is a fact that the Fisher information 
matrix can be arbitrarily close to being singular. It is therefore of central inter- 
est that the spectral decomposition of the Fisher information itself has a very 
nice structure, as shown in this section. 

Example 6. Consider a multinomial distribution based on 81 categories of 
equal width on [—5,5], where the probability associated to a bin is proportional 
to that of the standard normal distribution for that bin. The Fisher information 
for this model is an 80 x 80 matrix whose spectrum is shown in Fig. 3. By 
inspection it can be seen that there are exponentially small eigenvalues, so that 
while the matrix is positive definite it is also arbitrarily close to being singular. 
Furthermore, it can be seen that the spectrum has the shape of a half-normal 
density function and that the eigenvalues seem to come in pairs. These facts are 
direct consequences of the following results. 



Eigenvalues 




20 40 60 

rank 



Fig 3. Spectrum of the Fisher information matrix of a discretised normal distribution 



With 7I7q) denoting the vector of all bin probabilities except ttq, the Fisher 
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information matrix for the +1 parameters, written as a function of the proba- 
bilities, is the sample size times 

J(tt) := diag(TT {0) ) - n^)nf 0) , 

whose explicit spectral decomposition given, in all cases, in Appendix 1, is an 
example of interlacing eigenvalue results, (see for example [18], Chapter 4). In 
particular, suppose {7Ti}^ =1 comprises g > 1 distinct values Ai > • • • > X g > 0, A.; 
occuring rm times, so that X)f=i m i = Then, the spectrum of comprises 
g simple eigenvalues {Ai}f =1 satisfying 

Ai > Ai > ••• > X g > \ g > 0, (2.1) 

together, if g < k, with {Ai : m, > 1}, each such Ai having multiplicity m% — 1. 
Further, A g > •<=>• ttq > while each Ai (i < g) is typically (much) closer to A, 
than to Ai+i, making it a near replicate of Ai. 

In this way, the Fisher spectrum mimics key features of the bin probabilities. 
Of central importance, one or more eigenvalues are exponentially small if and 
only if the same is true of the bin probabilities, the Fisher information matrix 
being singular if and only if one or more of the {7r 2 ; }.f =0 vanishes. Again, typi- 
cally, two or more eigenvalues will be close when two or more corresponding bin 
probabilities are. We see this in Example 6 where, by symmetry of the distribu- 
tion, the bin probabilities arc paired, so that mi = 2. The (decreasingly) ordered 
plot of the eigenvalues, Figure 3, then resembles two copies of the half-density 
formed by folding at the mode. These dominant features are robust to which 
bin we omit in forming 7T( ) and to asymmetric placing of the bins. 

2.3. Likelihood in the simplex 

Potentially high dimensional simplicial structures being the natural spaces in 
which to base computational information geometry, a primary question is to look 
at the way that the likelihood, or log-likelihood, behaves in them. First note two 
important issues: in typical applications the sample size will be much smaller 
than the dimension of the simplex, while the simplex contains sub-simplexes 
with varying support. These two statements mean that our standard intuition 
about the shape of the log-likelihood function will not hold. In particular, the 
standard ^-approximation TO the distribution of the deviance does not hold. 

It will be convenient to call the face of the simplex spanned by the vertices 
(bins) having strictly positive counts the observed face, and the face spanned by 
the complement of this set the unobserved face. In the — 1-representation, the 
log-likelihood is strictly concave on the observed face, strictly decreasing in the 
normal direction from it to the unobserved face and, otherwise, constant. This 
is illustrated - a schematic representation of the quadrinomial case when there 
are two zeros in the vector of counts - in Figure 4, the — 1-flat subspaces being 
formalised in Theorem 2.1. 
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Unobserved Face 



Fig 4. The shape of the likelihood in a simplex 

The following theorem characterises the shape of the log-likelihood func- 
tion in the — 1-representation on the simplex. This function is concave, but not 
strictly concave, so, the theorem characterises where the lack of strict concavity 
comes from. Being given by the function ^2 ie -p riilogiri, with the constraints 
Sie-puz = 1 an< ^ Wi > it is immediate that the log-likelihood is constant on 
subsets defined by fixing tti G V and varying m G Z. The decomposition pre- 
sented in part (b) of the theorem shows that these subsets are, in fact, contained 
in -1-affine subspaces. 

Theorem 2.1. Let the observed counts be {ni}fLo an( ^ define two subsets of 
the index set {0, • • ■ , k} by V = {i|n, > 0} and Z = {i|n, = 0}. Let Vmix = 
{(t>o, ■ • ■ i v k)\ Vi = 0}, and further define the set V° C V m i X by {v G V^ixlvj = 
Mi e P}. 

(a) The set V° is a linear subspace ofV m i x . The log-likelihood is constant on 
— 1 affine subspaces of the form ir + V° . 

(b) Select k* G Z and consider the vector subspace ofV m - lx defined by 

V k ' := {v G V mix \v t = if i G Z\{k*}} . 

Then V m i x can be decomposed as a direct sum of vector spaces = V° © V k . 
Proof. See Appendix. □ 

3. Closure of exponential families 

This section shows how the closure of exponential families plays a role in the 
computational geometry. In §3.1 the discussion of §2.1 is formalised and con- 
nected to the information geometric concept of duality. Furthermore, in §3.2 
Example 3 is used to illustrate the fact that the way that the boundaries of the 
high dimensional simplex are attached to the model is of great importance for 
the behaviour of the likelihood and the for distribution of important inferential 
statistics. 
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3.1. Duality 

One of the key aspects of information geometry is the relationship between 
the +1, —1 and Fisher metric or O-geometric structures via the concept called 
duality. Following [1] when the underlying geometric object is a manifold the 
relationship between the +1 and —1 connections, denoted by V +1 and V -1 , 
and the Fisher information is captured in the duality relationship which can be 
written in terms of the inner product at 9, (, ) e , and any vector fields X, Y, Z 
via the equation 

X(Y,Z) = (V+}Y,Z) + (Y, V^Z) . (3.1) 

One consequence of this relationship is the existence on exponential families of 
a so-called mixed parameterization of the form (#,//), where 9 is +l-affine and 
a is — 1-affine, their level sets being Fisher orthogonal across the manifold: see 
[5]. 

The following definition gives a useful computational tool for understanding 
the limiting behaviour of exponential subfamilies in A fc , and gives a generalisa- 
tion of the trinomial model shown in Fig. 2. 

Definition 3.1. Let ir° = nr®, . . . , 7rjj!) be a probability vector, oi, . . . , dd be a 

set of vectors in M fc+1 ; such that 

lfe+i, a\, ■ ■ ■ ,ad 

are linearly independent, and b\ , . . . , bk—d be a set of linearly independent vectors 
in Vmix such that aj bj = for i = 1, . . . , d and j = 1, . . . , k — d. Furthermore, 
define 

P-k q : = CT ) : {p-w°{\ a ))h > for all /i = 0, . . . , fc} , 
in which A G R d , a £ M fc " d and 

MX,a)) h := - \ k _ d T , (3.2) 

Eh«=o { {^h* + Ej=i (<7jbj)h* ) exp{J] 4=1 (\iai) h * } | 



whe 



k-d 



7T 



h + E(^)fc)>0. (3.3) 



Note that for fixed a = a the image of p n o (•, a ) is a <i-dimcnsional exponen- 
tial family. As a changes these exponential families are +l-parallel. However 
for fixed A = A , the image of p 7r o(A°, •) is not in general — 1-affinc, but is for the 
special case when A = 0. Thus this construction, while having the advantage 
of being explicit, is not as strong as a true mixed parameterisation. However, 
the function defined in Definition 3.2 is a useful tool in understanding the lim- 
iting properties of exponential families within the extended multinomial model. 
Consider the set of possible values of a. By condition (3.3) it follows that the 
domain of a - for given tt° - is a polytope. As a converges to the boundary 
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of this polytope the corresponding exponential family converges to an extended 
exponential family defined on the boundary of A fe determined by the corre- 
sponding zeros in the probability vector. This construction generalises the plots 
in Fig. 2 (c) and (d). Notice also that it allows the definition of the limits of 
families which complements the pointwise limits defined in [8] and [11]. 

3. 2. Computing limits in exponential families 



Envelope of linear functions 




-15 -10 -5 5 10 15 

e 



Fig 5. The envelope of a set of linear functions. Functions: dashed lines, envelope: solid lines 

Example 7. In order to visualise the geometric s of the problem of comput- 
ing limits in exponential families consider a low dimensional example. Define 
a two dimensional full exponential family by the vectors v\ = (1,2, 3, 4),«2 = 
(1,4,9,-1) and the uniform distribution base point, embedded in the three di- 
mensional simplex. The 2 -dimensional family is defined by the +l-affine space 
through (0.25,0.25,0.25,0.25) spanned by the space of vectors of the form 

a(l, 2, 3, 4) + (3(1, 4, 9, -1) = (a + 0, 2a + 4/3, 3a + 9/3, 4a - /3) 

Consider directions from the origin found by writing a = 0/3 giving, for each 
9, a one dimensional full exponential family parameterized by (3 in the direction 
(3(9 + 1, 29 + 4, 30 + 9, A9 — 1). The aspect of this vector which determines the 
connection to the boundary is the rank structure of its elements. For example, 
suppose the first component was the maximum and the last the minimum, then 
as f3 — > ±oo this one dimensional family will be connected to the first and 
fourth vertex of the embedding four simplex, respectively. Note that changing 
the value of 9 changes the rank structure, as illustrated in Fig. 5. In this plot, 
the four linear functions of 9 are plotted ( dashed lines ) and the the impact of 
rank structure is determined by the upper and lower envelopes (solid lines). From 
this analysis of the envelopes of a set of linear functions it can be seen that the 
function 2(9 + 4 is redundant. The consequence of this is shown in Fig. 6 which 
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shows the result of direct computation in the two dimensional family. It is clear 
that, indeed, only three of the four vertexes of the ambient ^-simplex have been 
connected by the model. 

In general, the problem of finding the limit points in full exponential families 
inside simplex models is a problem of finding redundant linear constraints. As 
shown in [12], this can be converted, via duality, into the problem of finding 
extremal points in a finite dimensional affine space. 



Fig 6. Attaching a two dimensional example to the boundary of the simplex 

Example 3 (continued). Consider an N x D design matrix X with N samples 
and a binary response t E {0,1}^. Let s(x) = log (jz^j so that s _1 (a;) = 
i+cxp(x) ' ^ e ^°9^ s ^ c regression model being given by 

P(T l = l)= S -\p T Xl) 

where Xi t is the i th row of X . This is a full exponential family that lies in 
the 2 N — 1 simplex when considered a model for the joint distribution of the N 
binary response variates. A design matrix X defines a D '-dimensional +l-affine 
subset and changing the explanatory variates changes the direction of this low 
dimensional space inside the space of joint distributions. 

Consider response data (0, 1, 0, 1, 0, 1, 1), the explanatory variables being 
xq = (1, 1, 1, 1, 1, 1, 1) and x\ = (1, 2, 3, 4, 5, 6, 7). For convenience, in the space 
of all joint distributions, label the bin associated with the sequence {ti} 1 ^ 1 with 
the binary number which that sequence represents 

N-l 

E ( 3 - 4 ) 

i=0 

This logistic model is a two-dimensional exponential family which passes through 
the point corresponding to the uniform distribution of the 2 N simplex and lies 
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(JV-l \ z /N-l \ z 

E t ij X Oi > W 1 ! = E ^J^ 1 ' 

4=0 / J = l V 1 = /j=l 

where tij is the binary representation of vertex j. 




-0.8 -0.6 -0.4 -0.2 0.0 

theta 



Fig 7. Envelopes of lines 

As in Example 7 consider the way that this two-dimensional exponential fam- 
ily is attached to the boundary using the envelope method. There are 2" possible 
lines to consider and these are shown in Fig. 7. These lines whose duals are 
extremal points are plotted in red and it can clearly be seen that the upper and 
lower envelopes have been found. The corresponding vertices which the full ex- 
ponential family reaches are given by vectors of the form z with the structure 
either Zi — i = 1, . . . , h and 1 for i = h + 1, . . . , N or Zi — 1 i = 1, . . . ,h and 
fori = h + l,...,N. 

We can see how this global geometry affects the inference. One immediate 
issue is that if the observed data is a sequence which is one of the vertices 
listed above then the corresponding MLE will also lie on the boundary. Thus, 
for example, if the observed data is (0, 1, 0, 1, 0, 1, 1) there is a 'regular' turning 
point in /3-space. However if, instead, the data is (1, 1, 0, 0, 0, 0, 0) the MLE does 
indeed go to infinity and has its maximum at the correct vertex. This result for 
N = 7 in fact generalizes, when the explanatory variable is linear, to any N . 
The corresponding vertices which the full exponential family reaches are again 
given by vectors of the form z with one of the two structures identified above. 

4. The tools of information geometry 

In general, working in a simplex, boundary effects mean that standard first 
order asymptotic results can fail. Most standard methods are not uniform across 
the simplex. Therefore one way that the higher order asymptotic methods of 
information geometry have value is that they can be used to validate the region 
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of parameter space where the first order method will be accurate. Example 2 
has a continuous random variable with compact support and it is used to show 
how discretisation can be used to apply computational information geometry to 
such models. 

4-1- Higher order asymptotics: Edgeworth expansions 

One very powerful set of results from classical information geometry derives 
from the fact that geometrically based tensor analysis is well-suited for use in 
multi-dimensional higher order asymptotic analysis, see [6] or [29]. However, 
using this tensorial formulation is not without difficulty for the mainstream 
statistician. Its very efficient, tight notation may perhaps obscure rather than 
enlighten, while the resulting formulae can typically have a very large number 
of terms, making them rather cumbersome to work with explicitly. These obsta- 
cles to implementation are overcome by the computational approach described 
in this paper. The clarity of the tensorial approach is ideal for coding, while 
large numbers of additive terms, of course, are easy to deal with. Two more 
fundamental issues, which the global geometric approach of this paper high- 
light, concern numerical stability. The ability to invert the Fisher information 
matrix is vital in most tensorial formulae and so understanding its spectrum, as 
discussed in Section 2.2, is vital. Secondly numerical under and overflow near 
boundaries requires careful analysis and so understanding the way that models 
are attached to the boundaries of the extended multinomial models is equally 
important. 

An important aspect of higher order methods is not just their accuracy in 
a given example, but the way that they can be used to validate first order 
methods. In cases like logistic regression first order methods are typically used 
for inference despite the fact that they are not uniformly accurate across the 
parameter space of interest. In the example below the fact that the Edgeworth 
expansion is far from normal acts as a diagnostic for the first order methods. 

Example 3 (continued). Consider Fig. 8 where the parameters of a two di- 
mensional logistic family are such that the sampling distribution of the sufficient 
statistics is considerably far from normal. This is shown by the simulated sam- 
ple of black points, the red contours, computed numerically from the Edgeworth 
expansion, showing a good fit with the simulation, but a distribution which is 
far from the first order approximation. As holds widely, in this example, the 
Edgeworth expansion is easy to compute numerically. 

4-2. Continuity and compactness 

In order to use the high dimensional simplex models with continuous random 
variables it is necessary to truncate and discretise the sample space into a finite 
number of bins. The following theorems show that the information loss in doing 
this is arbitrarily small for a fine enough discretisation and that the key to 
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Fig 8. Using the Edgeworth expansion near the boundary of Example 3 



understanding the information in general is controlling the conditional moments 
in each bin of the random variables of interest, uniformly in the parameters of 
the model. 

Theorem 4.1. Let f(x; 0), 6 e 0, be a parametric family of density functions 
with common support X C R d each being continuously differ entiable on the 
relative interior of X , assumed non-empty. Further, let X be compact, while 



0_ 

<:) r 



\x e x 



is uniformly bounded in 9 £ by M , say. 

Then for any e > and for any sample size N > 0, there exists a finite, 
measurable partition {-Bfcjf^g N ^ of X such that: for all (xi , . . . ,xn) £ X N , and 
for all (6*o, 6») e 9 2 



log 



[ Lik d (0) 
\Lik d {6 Q ) 



- log 



Lik c (0) 

Liue ) 



< e, 



(4.1) 



where Likd and Lik c are the likelihood functions from the discretised and con- 
tinuous distributions respectively. 



Proof. Sec Appendix. 



□ 



The following result looks at the case where the family that is discretised is 
itself an exponential family and so the tools of classical information geometry 
can be applied. In general, after discretisation a full exponential family does not 
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remain full exponential and there is information loss. However, the following 
results show that this loss can be made small enough to be unimportant for 
inference and that all information geometric results on the two families can be 
made arbitrarily close. 

Theorem 4.2. Let f(x;6) = v(x) cxp {8 T s{x) -ip(6)}, x G X,9 G 0, be an 

exponential family which satisfies the regularity conditions of [1], p. 16. Further, 
assume that s(x) is uniformly continuous and s(X) is compact. 

Then, for any e > 0, there exists a finite measurable partition {B k }^ e J of X 
such that, for all choices of bin labels Sk G s(B k ), all terms of Amari's infor- 
mation geometry for f(x; 9) can be approximated to 0(e) by the corresponding 
terms for the family 

(MO), s k )\ir k (6) = [ f{x; 0)dx, s k G s(B k ] 

In particular: 

(a) For all 9, and any norm, 

|| Md (0)- Mc (0)||=O(e) 

where fj, d (6) = Yl k =o s k^k{0) and fj, c (9) = J x xf(x; 9)dx. 

(b) The expected Fisher information for 9 of f{x;9), I c {9), and the expected 
Fisher information for {ir k {&)}, Id{9), satisfies 

\\i d (0)-ic(e)U = o(e 2 ). 

(c) The skewness tensors T c (9), see [1], p. 105, of f(x;9) and T^{9) for 
{7Tfc(0)} satisfy 

\\T d (9)-T c (9)\\ 00 = 0{e 3 ). 

Proof. See Appendix. □ 

The following Corollary states that the likelihood before and after discreti- 
sation can also be made arbitrarily close with a fine enough discretisation, as 
illustrated in Fig. 9 drawn from Example 2, as described below. 

Corollary 4.1. Under the conditions of Theorem let 9 C denote the MLE 
based on a sample, x\, . . . , xn , from /(x, 9) and 9d the MLE for {ir k (9)} based 
on the counts n k , k = 0, . . . , K(e) for the partition {-Bfc}^o of Theorem 4-2. 
Then 

\\0d-0c\\ =0(e) (4.2) 



and 



d 2 £ d , 8 



2/ 



= 0(e) (4.3) 



d0rd0 a v ' dO r d9 s v 
Proof. See Appendix. □ 
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The following example illustrates these results and also shows an application 
of dimension reduction based on information geometry. Dimension reduction is 
dependent on the choice of afHne structure. The reduction here is done in the 
+l-affine geometry, unlike the mixture geometry examples, 1 and 4, where it is 
done in the — 1-geometry. 

Example 2 (continued). This example shows how results from information 
geometry can be numerically implemented in the resultant curved exponential 
family. An example in [17] concerns survival times Z for leukaemia patients 
measured in days from the time of diagnosis. Originally from [9], there are 43 
observations. For illustrative purposes the data is censored at a fixed value such 
that the censored exponential distribution gives a reasonable, but not perfect, fit. 
It is assumed the random variable Z has an exponential distribution but only 
Y = min{Z, t} is observed. As discussed in [28] this gives a one- dimensional 
curved exponential family inside a two dimensional regular exponential family 
of the form 



exp 



A 1 * + X z y - log ( e x2t - l) + e xl+x ' 2t 



(4.4) 



where y = min(z, t) and x = I(z > t) and the embedding map is given by 

{\\6),\\e)) = {-\ og e,-e). 



Loci-likelihoods Full exponential family 
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Fig 9. Computational information geometry: likelihood approximation and dimension reduc- 
tion 

Figure 9 shows some of the details of the geometry of the curved exponential 
family which is created after censoring. The censoring value was chosen at 750. 
The parameter of interest is fx, the mean of the uncensored observations. In the 
left hand panel of Fig. 9, the solid line is the likelihood function based on binning 
the data to bins of width four days and using a multinomial approximation. The 
dots in this panel are the log-likelihood for the raw data based on the continuous 
censored exponential model. As can be clearly seen there is no real inferential 
loss in the binning and discretisation process. The likelihood plot also shows 
appreciable skewness, which suggests that standard first order asymptotics might 
be improved by the higher order asymptotic methods of classical information 
geometry. 
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Distribution of MLE 
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Fig 10. Sampling distribution of /i for censored exponential based on saddlepoint approxima- 
tion 

The right hand panel shows the censored exponential ( solid curve ) embedded 
in the two-dimensional full exponential family in the +l-parameterization. The 
dashed contours are the log-likelihood contours in the full exponential family. 
It is clear, even visually, that there is not much +1 curvature for this family 
on this inferential scale. So this is an example where the curved exponential 
family behaves inferentially like a one- dimensional full exponential family. In 
particular, the dimension reduction techniques found in [27], can be used. To 
see the effectiveness of this idea, Fig. 10 shows how well a saddlepoint based 
approximation does at approximating the distribution of the maximum likelihood 
estimator of the parameter of interest. 

4-3. Higher order asymptotics: saddlepoint method 

The saddlepoint approximation method is very important tool from classical 
information geometry, see Fig. 10 for an example. Using this method requires 
the solving of the so-called saddlepoint equation in an efficient and accurate 
manner and so for computational information geometry this only needs to be 
done numerically. The problem of solving this non-linear equation is tied to un- 
derstanding the non-linear relationship between the +1 and — 1-parameters, and 
hence the rigorous implementation of numerical methods requires understand- 
ing the global geometry described above. For example, the issues surrounding 
such implementation being far from uniform across the simplex, it will help to 
be made aware if the method is being attempted in a region where first order 
asymptotics would work well or not. 

Example 2 (continued). Example 2 is a curved exponential family, [21]. Con- 
sider Fig. 11, this shows the level sets of the mean parameterization for the 
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2 -dimensional family plotted in the natural parameters. Solving the saddlepoint 
equation requires mapping between these two coordinate systems. The figure il- 
lustrates the issues which need considering in implementing numerical methods 
to do this. At point 'A ' in the figure we see that the level sets of the mean pa- 
rameter are becoming close to parallel - this reflects the fact that the Fisher 
information can be very close to singular, as discussed in §2.2. At the point 'B' 
the bifurcation in the parameters, described in $2.1, is clear. Again, the point 
'C shows a region where there is close to linearity between the two coordinate 
systems which is typical of when first order asymptotic methods work well, see 



Full exponential family 
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Fig 11. Mean parameterization (blue and red lines) plotted in the natural parameters 



5. Inference on Mixtures 

5.1. Lindsay's geometry and the simplex 

This section describes the way the mixture geometry of [25] is related to the 
information geometry of the simplex. In particular, it will lead to extending 
Lindsay's structure in a way which will give considerable computational ad- 
vantages in, for example, computing the non-parametric maximum likelihood 
estimate of a mixture model and understanding its variability. 

Lindsay's geometry lies in an affine space which is determined by the ob- 
served data. In particular, it is always finite dimensional, and the dimension 
is determined by the number of distinct observations. Following the notation 
of [24], which looks at mixtures of the model h(y\6) i.e. models of the form 
f(y,Q) = Jh(y\6)dQ{6), let L e = (Li(6), • • • , L N .{6)) represent the N* dis- 
tinct likelihood values of h{jji\0) arising from the data y\,...y n . The likelihood 
on the space of mixtures is defined on the convex hull of the image of the map 

0->(L 1 (p),...,L N *(6))cR N '. 
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M lb) 




Fig 12. (a) The simplex with a one- dimensional full exponential family (solid) and likelihood 
contours (dashed) (b) The image of the simplex under the map II £ 

Then the problem of finding the non-parametric likelihood estimate, determined 
by Q, is found by maximising a concave function over this convex set. 

There are clear parallels between the convex geometry of Lindsay and the 
embedding in the — 1-simplex. Lindsay's geometry is designed for working with 
the likelihood so only concerns the observed data, rather than the full sam- 
ple space. For simplicity consider discrete models where the distinct likelihood 
components are represented by probabilities 7Tj where, by definition, i lies in the 
observed face V defined in Theorem 2.1 (Section 2.3). The affine structure of 
Lindsay is thus determined by the vertices of V, see Fig. 12. 

Definition 5.1. Define fl^ to be the Euclidean orthogonal projection from a 
simplex to the smallest vector space containing the vertices indexed by V . 

The following result is strongly connected to Theorem 2.1. In it, the level sets 
of the likelihood are now characterised as the pre- images of the mapping LI^. It 
also shows that searching for the maximum likelihood in the convex hull in the 
simplex is the same as in Lindsay's geometry 

Theorem 5.1. a) The likelihood on the simplex is completely determined by 
the likelihood on the image o/II^. In particular, all elements of the pre-image 
of LTi have the same likelihood value. 

(b) TIl maps —1 convex hulls in the —1-simplex to the convex hull of Lindsay's 
geometry. 

Proof. See Appendix. □ 

Given this result, it is natural to study the likelihood of a convex hull in the 
simplex rather than in Lindsay's space. There are some definite advantages to 
this, some of which will be explored in this paper, while others will only be 
briefly mentioned. In Sections 5.2 and 5.3 a new search algorithm is proposed 
which exploits the information geometry of the simplex. In particular, it exploits 
dimension reduction directly in the simplex to give a direct way of computing 
the non-parametric maximum likelihood estimate. 

A further advantage of working in the simplex is that while Theorem 5.1 
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shows that Lindsay's geometry captures the —1 and likelihood structure, it does 
not capture the full information geometry. For example, the expected Fisher 
information cannot be represented, since it is a defined using the full sample 
space, and hence analysis of the variability of the non-parametric maximum 
likelihood estimate is more natural in the full simplex, rather than in the data- 
dependent space proposed by Lindsay. 

5.2. Total positivity and local mixing 

In order to consider dimension reduction in the —1 simplex, and the corre- 
sponding dimension of the convex hull, this paper concentrates on the case 
where the mixture is over an exponential family. At first sight, Theorem 5.2 and 
the following comments may appear contradictory. First Theorem 5.2 shows 
that — 1-convcx hulls of full exponential families have maximal dimension in the 
simplex, whereas the concept of local mixing, and its extension to polytope ap- 
proximation in Theorem 5.3, shows that there exist very good low dimensional 
approximations to these convex hulls. It is the existence of these low dimensional 
approximations which is exploited by the proposed algorithm. Using results on 
total positivity, we have 

Theorem 5.2. The — 1- convex hull of an open subset of a generic one dimen- 
sional exponential family is of full dimension. 

Proof. See Appendix. □ 

In this result "generic" means that the +1 tangent vector which defines the 
exponential family has components which are all distinct. 

Theorem 5.2 can be contrasted with the results of [26] or [2] which state, 
under regularity and for many applications, mixtures of exponential families 
have accurate low dimensional representations. The essential resolution of this 
apparent contradiction is that if the segment of the curve ir{8) for 8 g lies 
'close' to a low dimensional — 1-affinc subspace, then all mixtures over <3 also 
lie 'close' to this space. The following discussion is then concerned with the 
appropriate definition of 'close' for modelling purposes. 

Motivated by the idea of a local mixture, consider how well a full exponential 
family n(6) can be approximated by a —1 polygonal path which vertices 7r(#,), 
i = 1, . . . , M . Any point on this polygonal path will have the form 

(m(6i) + (1 - p)7r(0 i+ i) (5.1) 

with p e [0,1]. Define the segment Sj := {/37r(0») + (1 - p)-K(6 i+1 )\p € [0,1]}. 
So, on top of the usual label switching identification issue with mixtures, there 
is additionally the identification problem induced by 

J { P TT(9 l ) + (l-p)n(8 l+1 )}dQ( P )= J {pn(8 l ) + (l-p)7r(9 l+1 )}dQ'( P ) 

(5.2) 
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when Eq(p) = Eqi(p). While lack of identification is usually considered a sta- 
tistical problem, computationally it restricts the space the likelihood needs to 
be optimised over. It will be shown that restricting attention to this space has 
considerable computational advantages. 

Consider, then, the following definition and lemma. 

Definition 5.2. Given a norm || • ||, the curve it (9) and the polygonal path USi 
define the distance function by 

d{ir{9)) := inf \\tt(9) - tt|| . 

Lemma 5.1. If d(ir(9)) < e for all 9 then any point in the convex hull of it [9) 
lies within e of the convex hull of the finite set ir(9i). 

Proof. By the triangle inequality. □ 

Let tt np be the non-parametric maximum likelihood estimate for mixtures 
of the curve it (9). A consequence of Lemma 5.1 is that, under the uniform 
approximation assumption, tt np lies within e of the convex hull of the polygon. 
The question is then what norm is appropriate for measuring the quality of the 
polygonal approximation. 

Definition 5.3. Define the inner product 

k 

ViWi 



for v,w £ Vmix and tt such that tti > for all i. This defines a preferred point 
metric as discussed in [10]. Further, let \\ ■ || w be the corresponding norm. 

As motivation for using such a metric, consider the Taylor expansion for the 
likelihood around tt when the maximum is defined by turning point conditions, 
i.e. occurs at a point in the interior of the simplex. Under these conditions, to 
high order, it follows that 

lfr)-l(*)a-!i\\n-it\\l. (5.3) 

So small dispersions, as measured by || • ||#, correspond to small changes in 
likelihood values. Note that this is clearly not true under the standard Euclidean 
norm, where unbounded changes in likelihood values are possible. 

Following [25] , the maximum of the likelihood in a convex hull is determined 
by the non-positivity of directional derivatives, rather than turning points. So 
the following likelihood approximation theorem is appropriate. 

Theorem 5.3. Letir{9) be an exponential family, and {9i} a finite and fixed set 
of support points such that d(ir(9)) < e for all 9. Further, denote by tt np and tt 
the maximum likelihood estimates in the convex hulls ofn(9) and {7r(#i)|i = 1, . . . , 
respectively, and by tt^ := j£ the global maximiser in the simplex. Then, 

£(tt np ) - e(ir) < eN\\(7T G - n NP )\k + o(e) (5.4) 
Proof. See Appendix. □ 
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Table 1 

Observed frequencies of number of dead implants 



Number of dead implants 
Frequency 




214 



1 
154 



2 

83 



3 
31 



4 

25 



5.3. Implementation of Algorithm 

Algorithms using the polygonal approximation technique will be evaluated in 
detail in future work. Here a general outline is given and a couple of examples 
examined (Examples 1 and 4). The fundamental idea is to compute the convex 
hull of a finite number of points on the curve as an approximation to the convex 
hull of the curve itself. The positioning of the points can be decided by using 
singular value decomposition methods to see if the +1 line segment joining 
consecutive points has small enough —1 curvature. From these it is necessary to 
compute e which bounds the uniform approximation of the curve by the polygon 
and then apply Theorem 5.3. 

The first example implements the theorem for a mixture of binomials. 

Example 1 (continued). Consider the data discussed in [22] shown in part 
in Table 1. Mixture models are of interest scientifically since the data concerns 
frequency of implanted foetuses in laboratory animals, and it could be expected 
that there is underlying clustering. Simple plots shows over- dispersion relative 
to the variance of a fitted binomial model, which implies that a mixture approach 
might be appropriate. 

Using the polygonal approximation approach allows us to compute easily a 
good approximation to the mixture. The result can be shown in Fig. 13. The 
crosses show the fitted model with circles the data, here with a mixture over 
Bin(jr,7). We also see the mixing proportions and the directional derivative. 




Fig 13. The mixture fit using polygonal approximation 



Note in this example the near perfect fit of the data with the mixture model. 
In terms of the simplex this is easily explained since the maximum likelihood 
estimate in the simplex, in this case, lies inside the convex hull of the binomial 
model. 

Example 4 (continued). For this example, the distribution of the random vari- 
ables at all the observed nodes lies in the 2 3 — 1 = 7 dimensional simplex, param- 
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eterized by the joint probabilities for (X±, X2, X3). If H were observed each node 
would be independent, so that conditionally on H this space is 3-dimensional, 
and can be parameterized by the marginal probabilities. It is easy to show that the 
conditional model includes all 8 vertices of the 7 simplex, intersects six pairs of 
opposite edges and three pairs of opposite 2-faces. The full tripod model is a two 
component mixture over the three-dimensional full exponential family. Unlike 
the full convex hull of Example 1, the two component mixture model need not be 
convex in the — l-afjine space and so can have a complex multimodal likelihood 
structure. In order to aid visualisation, we also consider here the corresponding 
bipod model, see Fig. 14 




Fig 14. The bipod model: space of unmixed independent distributions showing the ruled-surface 
structure. 

In the tri- and bi-pod examples, the unmixed model can be approximated with 
unions of —1-affine polytopes. These can then be used to compute likelihood 
objects on the two hull and convex hull very efficiently just using convex pro- 
gramming. On each polytope the likelihood has a unique maximum which may, 
or may not, be on its boundary. To see the whole two-hull structure, you just 
need to glue together this finite number of polytopes and their maxima. Local 
maxima in the likelihood correspond to internal maxima in the polytopes. 

To see how to construct these approximating polytopes, consider Figure 14- 
The curved surface shown is a, so-called, ruled-surface intersecting the boundary 
in two pairs of opposite edges. Choose a finite number of support points on 
each edge of the surface and the same number on the opposed edge. Joining 
corresponding pairs of points gives a set of —I convex sets, or slices, close to the 
surface. Any point in the two hull - that is a convex combination of two points 
- lies in the convex polytope which is the convex hull of two of these slices. 

6. Discussion and further work 

This paper focused on four main objectives: (1) it showed that extended multino- 
mial distributions can be used to construct a computational framework demon- 
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strating commonality between the distinct areas of information geometry, mix- 
ture geometry and the geometry of graphical models, (2) it showed how this 
structure allow numerically implementation of results from these areas, (3) it 
extended results of information geometry to a simplicial based geometry for 
models which do not have a fixed dimension or support, and finally (4) it began 
the process of building a computational framework which will act as a proxy for 
the 'space of all distributions'. 

In continuous examples, a compactness condition is used to keep the underly- 
ing geometry finite. A following paper will look at the case where the compact- 
ness condition is not needed. In this case, infinite dimensional simplexes, and 
their closures, are used as the 'space of all distributions', the extension of clas- 
sical information geometry here requiring careful consideration of convergence, 
not required here due to finitcness. 

Later work will discuss a variety of statistical inference problems - including 
model selection and model uncertainty - using both these finite and infinite 
frameworks. 
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Appendix 1: On the spectral decomposition of the Fisher 
information 

For notational convenience denote 7T( ) = (tti, ■ ■ • , ftk) T so that l^Vm) = 1 — tto 
(i.e. bin is omitted in the k x 1 vector 7T(o)) and Ilm) = diag(irra\). Without loss, 
after permutation, assume wi > ■ ■ • > Ttk ■ Apart from the trivial case ttq = 1 , 
when I(tt) := II(o) — 7T (o) 7r fo) vanishes, its spectral decomposition (SpD) comes 
in the following cases. 

Case 1 it i > 717+1 = . . . TTk = for some < I < k. The SpD of 



where 11+ = diag(ir + ) and 7r+ = (tt\, . . . , 7T;) T , follows from that of 11+ — 
7r+7r^ given below. 
Case 2 k — 1 is trivial. 




Case 3 k > 1, n = Xl k , A > 0. The SpD of I{n) is 



XC k + A(l - kX) J fc 



where Ck = Ik — Jk and Jk — k Here A has multiplicity k — 1 

and eigen-space [Span(lk)]' L , while A := A(l — kX) has multiplicity 1 and 
cigcn-space= Span(lk). In particular, using (1 — kX) = ttq, 



I (it) is singular 
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Case 4 This is the generic case. Denoting by O m the zero matrix of order m x m, 
and by P(y) the rank one orthogonal projector onto Span(i/), (y ^ 0), if 
7T( ) = (Ail^J . . . |A s l^ g ) T , g > 1 and Ai > • • • > A s > 0, then the SpD 
is 

9 9 ffX A 

E Xidiag(O mi _ , C mi ,O mi _ )+E AiP - \ 1 
i=l,m i >l i=l \ V Ai _ Al A < ~~ A 9 

where = XX m jli < i}, mi+ — ^2{rrij\j > i} and the Ai are the zeros 
of 

ma) : =i+Efr| = (i-E + A (£ f^) • 

In particular, {Ai : i = l,---,g} are simple eigenvalues satisfying (2.1) 
while, whenever m,, > 1, A,; is also an eigenvalue having multiplicity 
Further, expanding det(/(7r)), we again find: 

I (ft) is singular ttq = 0, 

so that A g > ttq > 0, as claimed. Finally, we note that each Ai 
(i < g) is typically (much) closer to Aj than to A,+i. For, considering the 
graph of x ->■ 1/x, h ((A,- + A,+i)/2 + J (A* - \ i+1 )/2) (-1 < 5 < +1) is 
well-approximated by 



1 



2m,Af 2m i+ iAf +1 



(Ai-A i+ i)(l-^) (Xi-X i+l )(l + S) 



whose unique zero 5* over (—1, 1) is positive whenever, as will typically be 
the case, m,; = m^+i (both will usually be 1) while (m^Ai + mj_|_iAj-|_i) < 
1/2. Indeed, a straightforward analysis shows that, for any m, and m !+ i, 
<5* = 1 + O(Xi) as Ai -> 0. 

Appendix 2: Proofs 

Proof of Theorem 2.1. (a) Immediate. 

(b) Let w <E Vmix so that v i = and write w as x + y where 

Vi if i G Z\{fc*} 

if i e P 

- SiGZ\{*:*} u « ifi = fc* 
and 

f o ifiez\{/c*} 

Ui = I vi if i e V 

[ V k+T, ie z\{k*} v i if* = fc* 

Then, it is immediate that x is in V° and y is in y fc , the decomposition v = x+y 
being clearly unique. □ 
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Proof of Theorem J^.l. Let {B k } k= Q be any finite measurable partition of X. 
Then defining ir k (6) := J B f(x; 9)dx gives for i = 1, . . . , N, and fc = 0, . . . , K, 



f( Xi ;6) + (x-x i ) T ^(x*; t 



dx 



B k{i) . Thus 



where x* is a convex combination of x and [3] p. 124, Thm 6-22, and xi <E 
is, 

^k( t )(S) - f(xi;0)\B k{i) \\ = 



(x - Xi) T — (x*;9)dx 



< M di&m(B k(i) )\B k(i 



fe(i)b 



(A.1) 



where \B\ := J" B da; and diam(-B) := sup( x y ) e B 2 \\ x ~ v\\- 

It is clear that for compact X there exists a sequence of finite measurable 
partitions B(5) — {-Bfe(5)}j2o such that as 5 — > 0+ 



max \B k (5)\ — > 0,max{diam(i?/ £ ((5)} — > 
From (A.l) it follows that 

n k(i) (9) = f(xi;6)\B k(i) (5)\+o(\B k(i) (5)\), 

so that 

Lik d (0) Lik c (0) 



(A.2) 



o(\B H;) (5)\) 
/(x i; 8o)|B Mi) (5)| 



Lik d (9 ) Lik c (9 ) ) ^ N . ^ 



°(K W (*)|) 

/(a: i; 9)|B fc(0 (*)| 

Since /(a^; 0) is bounded away from zero for all 0, this gives 
Lifc d (0) Lifc c (0)(l + O(£)) 



Lik d (0 o ) 



from which the result follows. 



Lik c (6 )(l + 0(5)) 
Llkc{9) -(1 + 0(5)), 



Lik, 



□ 



Proof of Theorem 4-2 . From the uniform continuity of s(x) and the compact- 
ness of s(X), there exists a finite measurable partition {B^^Iq such that for 
all k and for all x, y € B k , 



\\s(x)-s(y)\\ <e. 
It follows from (A.3) that for all x G B k and for all 9 e Q, 
\\s(x) - E e (s(X))\X G B k )\\ < e. 



(A.3) 



(A.4) 
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From (A. 4) it further follows that 

Cov e (s r (x),s s {x)\X G B k ) = 0(e 2 ) (A.5) 

and 

T rst (9\k) := E e (t r t s t t \X G B k ) = 0(e 3 ) (A.6) 

where t r := s r — E(s r (X)\X G B k ). 

Further by direct calculation it follows that 

J-log^) = Ee (s r (X)\XeB k )-^-(9) (A.7) 
■log7r fc (0) = Cove( Sr (X),s s (X)\X eB k )-^-(6)(A.8) 



d6 r dO s b w <v d0 r d0 s 

lo g7 r fc (0) = £ e (W t ]X G fl fc ) - {6). (A.9) 



d6 r dO s d6 t ° w y d9 r d6 s d9 t 

Finally, (a) follows immediately from (A. 3) and (A. 4), (b) from (A.5) and (A. 8), 
and (c) from (A.6) and (A.9). □ 

Proof of Corollary J^.l. The score equations for 9 C are §^{9 C ) = — s ^j- x ^ , while 
from (A.7) those for 9^ are 

di> .* . J2n k E(s k (X)\X e B k ) 
Wk (8 d ) - ^ ■ 

Using (A.5) and that ijj' has a continuous inverse gives (4.2), while (4.3) follows 
from (4.2) and (A.8). 

□ 

Proof of Theorem 5.1. (a) The log- likelihood can be written as X^eP n * 1°§ 1Ti 
which is clearly constant for all probability vectors with the same image under 
Hl since they share the same elements 7Tj,i G V. (b) Since IIl is linear it 
preserves —1 convexity. □ 

Proof of Theorem 5.2. For any (7r») G A k with each 7Ti > 0, 6q < • • ■ < 9 k and 
s < • • • < s k , let B = (n(9o), n(9 k )) have general element 

ni(9j) := Wi exp[si9j - i/>(0j)]. 

Further, let B = B — 7r(0o)ljL-D whose general column is ir(6j) — tt(9q). Then, it 
suffices to show that B has rank k. But, using [19] p. 33, Rank(B) = Rank(B) — 
1, so that 

Rank(B) = k B is nonsingular B* is nonsingular, 

where B* = (exp[s,0j]). It suffices, then, to recall [20] that K(x,y) = cxp(xy) 
is strictly total positivity (of order oo), so that detB* > 0. 

□ 
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Proof of Theorem 5.3. We use a similar expansion to (5.3), adapted to take into 
account the fact that the NPMLE is defined by directional derivatives being 
non- negative, rather than zero [25]. 

If 7r is a member of the convex hull of tt(0), then the directional derivative 
from -ft to 7r is a finite convex combination of elements of the convex cone of 
directional derivatives from tt np to points in the curve ir(9). For any point tt(9) 
consider the perturbation from 77"^ of the form 



tt(A) 



•NP 



+ X(tt(6) — 7T ) 



There are two cases to consider: (i) either 9 is a support point of tt np or (ii) it 
is not. 

Case (i) In this case the directional derivative are zero. Accordingly we can 
apply (5.3) directly to have that the change in log-likelihood is o(e). 
Case (ii) In this case, for small enough positive A, 7f(A) remains in the convex 
hull. Further, the difference in log-likelihood values is then 

mbgMA))- £ miog(7if p ). 

Since the directional derivatives are now non-zero, consider the first order term 
in the Taylor expansion of this difference: 



i|?i;>0 



MO) 



7T<(A) 



rn{-Ki(9) 



•NP\ 



A=0 



i=0 



-NP 



xny: 

i=0 



• r NP\ 



sJVP 



< XN\\(tt g - tt^II^HIW^) - ^ P )IU-- 



Considering A small enough that 



HA) 



- NPi 



X\\(n(9) 



•NP 



we have that to first order the change in log-likelihood values for points 7r(A) 
within e of tt np is bounded by 



eiV||(7r c 



- NP\ 



So it has been shown that all points in the convex hull of ir(9) which are within 
e of tt np satisfy (5.4). From Lemma 5.1 there is at least one point in the convex 
hull of the polygon which is within e of the convex hull. Hence the maximum 
likelihood value at 7r also satisfies (5.4). □ 
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